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Emotion recognition through speech has many potential applications, 
however the challenge comes from achieving a high emotion recognition 
while using limited resources or interference such as noise. In this paper we 
have explored the possibility of improving speech emotion recognition by 
utilizing the voice activity detection (VAD) concept. The emotional voice 
data from the Berlin Emotion Database (EMO-DB) and a custom-made 
database LQ Audio Dataset are firstly preprocessed by VAD before feature 
extraction. The features are then passed to the deep neural network for 
classification. In this paper, we have chosen MECC to be the sole 
determinant feature. Erom the results obtained using VAD and without, we 
have found that the VAD improved the recognition rate of 5 emotions 
(happy, angry, sad, fear, and neutral) by 3.7% when recognizing clean 
signals, while the effect of using VAD when training a network with both 
clean and noisy signals improved our previous results by 50%. 

Copyright © 2019 Institute of Advanced Engineering and Science. 

All rights reserved. 


Corresponding Author: 

Teddy Surya Gunawan, 

Department of Electrical and Computer Engineering, 
International Islamic University Malaysia, 

Jalan Gombak, 51300 Selangor, Malaysia. 

Email: tsgunawan@iium.edu.my 


1. INTRODUCTION 

Speech emotion recognition (SER) is the ability of a system to recognize human emotions from 
speech. Typically this can be performed in two ways, textual/context analysis where whatever the speaker 
says is transcribed first into text then performing linguistic analysis, or analyzing the speech signal patterns 
itself. The latter will be the focus on this paper. The challenge of SER comes from the ability to achieve a 
high recognition rate while having limited resources (time, processing power) or interference 
(noisy background). 

The recent trend of performing SER is to employ machine learning for the system to learn the 
speech emotion feature patterns. There are many popular machine learning models such as support vector 
machine (SVM) [1-3], Hidden Markov Model (HMM) [4, 5], or artificial neural networks in many forms 
such as convolutional neural networks (CNN) [6, 7] or recurrent neural network (RNN) [8]. A neural network 
can be considered deep when it uses more than a single layer. Although the methodology may vary, the 
general flow of SER can be viewed in Eigure 1. 

The speech signals used to train a neural network comes from a dataset where emotional speech 
spoken are recorded and labeled. Typically a clean (minimum noise) dataset is recorded in a studio with 
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elicited emotional speech. However there are also noisy audio datasets which imitates the condition in natural 
environment. To handle this, usually a preprocessing step is performed before training. 



Eigure 1. A typical speech emotion recognition algorithm 


Voice activity detection (VAD) is one prominent method of pre-processing speech data. A VAD 
functions to detect presence or absence of human voice in a signal [9]. Researches on practical applications 
of VAD and how to implement them are numerous, such as the paper published by [10] which proposes the 
implementation of real-time VAD smartphone application or the algorithm proposed by [11] for assisting 
those with hearing problems. The personality identification research by [12] was improved by utilizing VAD. 

In this research, the main usage of VAD is to improve the results of SER. Other similar research 
using VAD to improve SER such as [13] yielded an accuracy rate of 96.97%. This is achieved however for 
only 3 emotions, angry neutral and sad. Another research conducted by [14] incorporated the arousal-valence 
theory and obtained a recognition rate of 95% for 4 emotions, happy, fear, anger, and sad. Other research, 
such as [15], investigated the effects of VAD by introducing 5 different types (Voice babble, Eactory noise, 
HE radio channel, E-16 fighter-jets, and Volvo 340) to the clean audio signals. Their proposed VAD system 
have shown a mean improvement of 5.54% when across the 5 noises when tested for the Berlin Emotion 
Database (EMO-DB). 

While research of VAD in SER has already been performed, there are still room for improvement in 
terms of accuracy or the amount of emotions detected. Hence in this paper we contribute to the field in two 
ways. The first is showcasing our proposed SER system results by using VAD, benchmarking with other 
papers using similar methodology. The second is investigating the effects of VAD when mixing a clean 
emotion dataset with a noisy one for training and testing a deep neural network. The rest of the paper is as 
following. Section 2 briefly outlines our steps taken to perform SER. Section 3 displays the results obtained 
and a discussion as well as benchmarking. Section 4 concludes by summarizing the content of this paper. 


2. RESEARCH METHOD 

The research conducted in this paper continues our previous research in [16], but now with the 
addition of VAD. Our system has 4 steps to perform SER. The VAD is performed in the preprocessing stage, 
where the audio files are passed through Sohn’s VAD algorithm to remove the silent segments. After, the 
MECC of the speech signals are extracted for feature extraction. The MECC is then used to train and test the 
deep neural network. Eor this system, we adopt 2 datasets, the German Emotion Database EMO-DB and a 
custom made low quality database. The overview of the system can be view in Eigure 2. 



Eigure 2. Proposed SER with VAD methodology 
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As mentioned previously, the investigation has been divided into two parts. The first is investigating 
the effects of VAD on a clean signal dataset, then comparing the results obtained with and without VAD 
recognition. The second part is testing the VAD on noisy speech input. The following subsections elaborates 
in more detail each step. 

2.1. Voice activity detection pre-processing 

A typical VAD algorithm is shown in Figure 3. For this project, the VAD is performed by using the 
Sohn’s VAD algorithm which integrates a decision-directed parameter estimation and HMM-based 
hang-over scheme to improve the results [17]. This is implemented as ‘vadsohn’ using VoiceBox toolbox 
[18]. The algorithm uses a mix of The VAD output is a probability decision on a frame-by-frame basis, with 
a threshold probability of 0.70, as shown in Figure 4. 

Each audio files are then resampled at original sampling frequency of 16 kHz concatenating only the 
signals with voice signal detected, to then be passed to the feature extractor. Figure 5 shows a sample of 
VAD pre-processing performed on a sample audio from the EMO-DB and LQ Emo Dataset. 
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Figure 3. A typical voice activity detection algorithm 
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Figure 4. Sohn’s VAD Algorithm 
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Figure 5. VAD pre-processing step, (a) EMO-DB, and (b) LQ Audio Database 
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2.2. Mel-frequency cepstral coefficients (MFCCs) feature extraction 

Eor the reasons outlined in [19] Mel-Erequency Cepstral Coefficients to be the sole feature for the 
classification. MECCs use a non-linear frequency scale, i.e. mel scale, based on the auditory perception. A 
mel is a unit of measure of perceived pitch or frequency of a tone. The (1) can be used to convert frequency 
scale to mel scale. 

/^,, = 1771n(l+^) (1) 

where is the frequency in mels and f^z is the normal frequency in Hz. MECCs are often calculated using 
a filter bank of M filters, in which each filter has a triangular shape and is spaced uniformly on the mel scale 
as shown in (2). 


r 0 

k-f[m-l] 


Hm[k] = < 


f[m+l]—k 


0 


k < f[m - 1] 
f[m - 1] < k < f[m] 

f[m] < k < f[m + 1] 
k > f[m + 1] 


( 2 ) 


where m = 0,1, •••, M — 1. The log-energy mel spectrum is then calculated as follows: 

S[m] =]n['ZkZo\X[k]\^Hm[k]] m = 0,1,-,M - 1 (3) 

where X[k] is the discrete Eourier transform (DET) of a speech input x[n]. 

Although traditional cepstrum uses inverse discrete Eourier transform (IDET), mel frequency 
cepstrum is normally implemented using discrete cosine transform (DCT) since S[m\ is even as shown in (4), 
as follows: 


= ELo-5Mcos[(m + i)^] m = Q,l,---,M -1 (4) 

MECC is one of the most popular speech feature to be utilized for SER, as shown in research [1-7]. 
This common usage enables us to benchmark the results. Eor this step, the melcepst module from VOICBOX 
is used to extract the MECC. We have used the default parameters, namely Hamming window in time 
domain with a triangular shaped filter in the mel domain. 

2.3. Deep feedforward neural network classification 

In this paper, we utilize the deep neural network algorithm for the system to learn and classify the 
emotions. MATLAB neural network pattern recognition tool is used with 70/15/15 training/validation/testing 
ratio. As the purpose of this paper is to investigate the effects of VAD on the recognition rate, the network 
size is kept at a constant rate of 2 hidden layers with 10 neuron each, using a varying number of MECC as the 
features, as shown in Eigure 6. Eor each iteration, the training/validation/testing data is randomized. 



2.4. Dataset 

This research uses the Berlin Emotion Database (EMO-DB) [20] as the primary dataset. The 
EMO-DB contains simulated emotional voices from 5 female and 5 male actors in 10 German utterances (5 
short and 5 longer sentences) in 7 emotions. Eor this experiment, we use only 60 for each emotion so that the 
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training would be equal across emotions. As we are analyzing up to 5 emotions, the total voice lines used is 
300 samples. 

For the second part of the research, we reused the Low Quality Emotional dataset (LQ Emo Dataset) 
collected from [16], which contains 148 low quality emotional speech in .ogg format. The data collection was 
performed using a common mobile phone recorder under noisy environment, transmitted with WhatsApp 
audio message. WhatsApp is using the Opus codec, a lossy audio coding format, for voice media streams at 
either 8 kHz or 16 kHz [21]. The database contains the short utterance in 4 emotions which has the most 
potential applications in fields such as customer service, which is happy, angry, sad, and neutral. Each audio 
file contains 5 daily conversational lines in English, such as ‘hello, good morning” and the average length of 
the files are 2.26 seconds. 


3. RESULTS AND ANALYSIS 

The SER is performed using MATLAB R2018b on an Intel (R) Core i5-7200U CPU @ 2.50GHz. The 
number of MFCC is varied from 1-30. Each investigation is repeated 3-5 times then the average is taken 
when necessary to ensure accuracy. 

3.1. Experiments of VAD in EMO-DB 

From the methodology, we firstly experiment the SER accuracy for 5 emotions. The chosen 
emotions are happy, angry, sad, fear, and neutral. 300 emotional voice audio signals from the clean dataset 
(EMO-DB) are passed to the system for the MFCC feature extraction without pre-processing. The obtained 
recognition rate of emotion is 88%, as shown in Figure 7(a). 

The experimental setup is then repeated but now with VAD pre-processing. All 300 files are passed 
through the vadsohn VAD, removing any segments without detected voice. Although the dataset is 
considered with minimum noise, the VAD typically trims the beginning and the end of the emotion audio. 
The processed data is then passed to the feature extraction and finally the classifier. The obtained recognition 
rate is 91.7% as shown in Figure 7(b), an improvement of 3.7%. The best results were obtained when MFCC 
is set to 30 coefficients. 




Target Class Target Class 


(a) 


(b) 


Figure 7. Emotion recognition results for clean dataset, (a) Without VAD, and (b) with VAD 


Benchmarking with the other papers using similar methodology and VAD, our results were not as 
accurate as that obtained in [13] or [14], however considering that we have also included more emotion to 
detect, this recognition rate is acceptable. For the paper by [13], the amount of emotions detected is only 
limited to angry neutral sad, [14] detects happiness anger sadness fear, while our study recognizes up to 5 
emotions, happiness angry sadness neutral and fear. As the number of emotions detected increases the 
complexity of the system also increases. 
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Another aspect to consider is the ratio of training/validation/testing. The research in [13], used a 
ratio of 80/10/10 compared to our 70/15/15. Theoretically when the training size is increased, so does the 
accuracy rate. 

One last factor to consider is that [13] has split every file into 20 millisecond chunks with no overlap 
vectors of length 320 (16 kHz * 20ms) while our methodology takes the whole audio file and average the 
obtained MECC, which limits the amount of training and testing data. Ideally for training a deep neural 
network, more samples are better, hence in the future study we plan to add more sample files for training. 

3.2. VAD Experiment on Both EMO-DB and LQ Emo Dataset 

Erom the second part of the investigation, we investigated the effects of utilizing VAD when mixing 
datasets from clean and noisy data. Erom our previous study [16], we found the accuracy rate was poor when 
the noisy and clean are mixed, yielding an accuracy rate of 23.3%, as shown in Eigure 8(a). 

240 lines of 4 emotions (happy, angry, sad, and neutral) from the EMO-DB are mixed with 120 
noisy lines for network training and testing. A sample of the VAD process using LQ Audio Dataset can 
viewed in Eigure 5(b). The number of MECC is fixed to be 30, following the best results obtained in 3.1. By 
applying VAD to both, we achieved a great improved recognition rate of 73.6%, which not only implies that 
VAD can be used to process to noisy signals, but also can be used when dealing with a mix of low quality 
audio and clean audio, as displayed in Eigure 8(b). This result is consistent with the results obtained by [15] 
which shows that VAD is particularly useful when dealing with noisy signals, or when mixing between both 
clean and noisy. 




(a) (b) 

Eigure 8. Emotion recognition results from Iq audio database, (a) Results without VAD [16], (b) Results 

with VAD 


Another factor to consider is that the dataset in EMO-DB is in German while the LQ Audio Dataset 
is in English. The high improved results using VAD may be attributed to the fact that certain languages 
convey more information at different rates. According to [22] among 7 languages, German is one of the 
slowest in terms of syllables per second, while English hovers around the middle. There is also the 
consideration of different lingos or dialects for each speaker. By applying VAD, the unnecessary information 
such as gaps between words can be filtered out and the SLR system can focus on the important signals only. 


4. CONCLUSION 

To achieve a more accurate speech emotion recognition, this study has employed Sohn’s VAD 
algorithm in the pre-processing step. We have tested the algorithm against two datasets-a clean dataset from 
EMO-DB and a noisy dataset created from our previous study. Erom the results obtained, we have shown that 
by using VAD, the accuracy rate of SLR has improved by 3.7% for 5 emotions when testing against a clean 
dataset. The second part of the study tested the usage of VAD when mixing a dataset of clean and noisy 
dataset. Erom the results obtained, we have found that the recognition rate greatly improved from our 
previous study, up to 50% recognition rate. Our results encourage future studies to include VAD in the pre¬ 
processing step especially when dealing with mixed datasets. 

There are a few shortcoming in this research. The first is the limited amount of data for training and 
testing. Our methodology takes the average MECC of a single speech sample rather than taking multiple 
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frames per sample, hence limiting the total number of samples. The next is the usage of the LQ Emo Dataset. 
The dataset’s number of speech samples are less than that of the EMO-DB samples used. Ideally to train the 
network an equal amount of each dataset is used. Nonetheless, both limitation are addressed by repeating the 
experiment 3-5 times to ensure that the errors are kept at the minimum. Euture studies can further expand the 
number of emotions detected, as the EMO-DB has 7 types of emotion recorded. Eor the LQ Emo Dataset, 
additional audio is planned to be recorded in different languages. 
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